Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : reduce useless copies when saving session #8916

Merged
merged 2 commits into from
Aug 9, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Aug 7, 2024

Should help with #8915.

Some useless copies to immediately discarded temporary buffers were done as of #8699 because session size calculation and session file writing now share mostly the same code.

On CPU, the speed was reasonable, but on CUDA, as reported in #8915, this makes the session size calculation too slow.

To fix this, it's possible to simply avoid calling ggml_backend_tensor_get when the data won't be used (i.e. when calculating the session file size).

I've also eliminated the double tensor copies when saving the state to a buffer.

@josharian does this help with your use-case?


TODO

  • Test state saving and restoring to and from a buffer
  • Test session file saving and restoring
  • Test sequence saving and restoring

@compilade compilade added performance Speed related topics bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Aug 7, 2024
@josharian
Copy link

This looks great. Thank you!

@josharian
Copy link

OK, re-ran my profiling so I have real numbers to share.

This speeds up llama_state_seq_get_size by >25x, to the point that it is now legitimately cheap enough to use. :) Yay! Thank you!

This cuts roughly 5-6% off llama_state_seq_get_data, which is good, but not yet enough to make it usable in practice. I think batching up the tensor transfers (when contiguous) is probably the next thing to try out on this front.

@compilade compilade merged commit 345a686 into master Aug 9, 2024
53 of 54 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* llama : avoid useless copies in dummy session writer

* llama : avoid double tensor copy when saving session to buffer
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* llama : avoid useless copies in dummy session writer

* llama : avoid double tensor copy when saving session to buffer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug performance Speed related topics Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants